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ABSTRACT 

An answer to a query has a well-defined lineage expression (alter- 
natively called how-provenance) that explains how the answer was 
derived. Recent work has also shown how to compute the lineage 
of a non-answer to a query. However, the cause of an answer or 
non-answer is a more subtle notion and consists, in general, of only 
a fragment of the lineage. In this paper, we adapt Halpern, Pearl, 
and Chockler's recent definitions of causality and responsibility to 
define the causes of answers and non-answers to queries, and their 
degree of responsibility. Responsibility captures the notion of de- 
gree of causality and serves to rank potentially many causes by their 
relative contributions to the effect. Then, we study the complexity 
of computing causes and responsibilities for conjunctive queries. It 
is known that computing causes is NP-complete in general. Our 
first main result shows that all causes to conjunctive queries can 
be computed by a relational query which may involve negation. 
Thus, causality can be computed in PTIME, and very efficiently so. 
Next, we study computing responsibility. Here, we prove that the 
complexity depends on the conjunctive query and demonstrate a di- 
chotomy between PTIME and NP-complete cases. For the PTIME 
cases, we give a non-trivial algorithm, consisting of a reduction to 
the max-flow computation problem. Finally, we prove that, even 
when it is in PTIME, responsibility is complete for LOG SPACE, im- 
plying that, unlike causality, it cannot be computed by a relational 
query. 

1. INTRODUCTION 

When analyzing complex data sets, users are often interested in 
the reasons for surprising observations. In a database context, they 
would like to find the causes of answers or non-answers to their 
queries. For example, "What caused my personalized newscast to 
have more than 50 items today?" Or, "What caused my favorite un- 
dergrad student to not appear on the Dean's list this year?" Philoso- 
phers have debated for centuries various notions of causality, and 
today it is still studied in philosophy, AI, and cognitive science. 
Understanding causality in a broad sense is of vital practical im- 
portance, for example in determining legal responsibility in multi- 
car accidents, in diagnosing malfunction of complex systems, or 
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Figure 1: A SQL query returning the genres of all movies di- 
rected by Burton, on the IMDB dataset l |www . imdb . org) . The 
famous director Tim Burton is known for dark, gothic themes, 
so the genres Fantasy and Horror are expected. But the genres 
Music and Musical are quite surprising. The goal of this paper 
is to find the causes for surprising query results, 

scientific inquiry. A formal, mathematical study of causality was 
initiated by the recent work of Halpern and Pearl 1 13 1 and Chock- 
ler and Halpern p), who gave mathematical definitions of causality 
and its related notion of degree of responsibility. These formal defi- 
nitions lead to applications in knowledge representation and model 
checking |9 10 5 1. In this paper, we adapt the notions of causality 
and responsibility to database queries, and study the complexity 
of computing the causes and their responsibilities for answers and 
non-answers to conjunctive queries. 

Example 1.1 (IMDB). Tim Burton is an Oscar nominated 
director whose movies often include fantasy elements and dark, 
gothic themes. Examples of his work are "Edward Scissorhands", 
"Beetlejuice" and the recent "Alice in Wonderland". A user wishes 
to learn more about Burton 's movies and queries the IMDB dataset 
to find out all genres of movies that he has directed ( see \Fig. 1\ . 
Fantasy and Horror are quite expected categories. But Music and 
Musical are surprising. The user wishes to know the reason for 
these answers. Examining the lineage of a surprising answer is a 
first step towards finding its reason, but it is not sufficient: the com- 
bined lineage of the two categories consists of a total of 137 base 
tuples, which is overwhelming to the user 

Causality is related to provenance, yet it is a more refined no- 
tion: Causality can answer questions like the one in our example 
by returning the causes of query results ranked by their degree of 
responsibility. Our starting point is Halpern and Pearl's definition 
of causality 1 13 1, from which we borrow three important concepts: 

(1) Partitioning of variables into exogenous and endogenous: Ex- 
ogenous variables define a context determined by external, uncon- 
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(b) Responsibility rankings for Musical 
Figure 2: Lineage (a) and causes with their responsibilities (b) for the Musical tuple in Example [LT) 

cerned factors, deemed not to be possible causes, while endogenous 
variables are the ones judged to affect the outcome and are thus 
potential causes. In a database setting, variables are tuples in the 
database, and the first step is to partition them into exogenous and 
endogenous. For example, we may consider Director and Movie 
tuples as endogenous and all others as exogenous. The classifica- 
tion into endogenous/exogenous is application-dependent, and may 
even be chosen by the user at query time. For example, if erroneous 
data in the directors table is suspected, then only Director may 
be declared endogenous; alternatively, the user may choose only 
Movie tuples with year>2008 to be endogenous, for example in 
order to find recent, or under production movies that may explain 
the surprising outputs to the query. Thus, the partition into endoge- 
nous and exogenous tuples is not restricted to entire relations. As a 
default, the user may start by declaring all tuples in the database as 
endogenous, then narrow down. 

(2) Contingencies: an endogenous tuple t is a cause for the ob- 
served outcome only if there is a hypothetical setting of the other 
endogenous variables under which the addition/removal of t causes 
the observed outcome to change. Therefore, in order to check that 
a tuple t is a cause for a query answer, one has to find a set of en- 
dogenous tuples (called contingency) to remove from (or add to) 
the database, such that the tuple t immediately affects the answer 
in the new state of the database. In theory, in order to compute 
the contingency one has to iterate over subsets of endogenous tu- 
ples. Not surprisingly, checking causality is NP-complete in gen- 
eral |[9). However, the first main result in this paper is to show that 
the causality of conjunctive queries can be determined in PTIME, 
and furthermore, all causes can be computed by a relational query. 

(3) Responsibility, sl notion first defined in |5 1, measures the de- 
gree of causality as a function of the size of the smallest contin- 
gency set. In applications involving large datasets, it is critical to 
rank the candidate causes by their responsibility, because answers 
to complex queries may have large lineages and large numbers of 
candidate causes. In theory, in order to compute the responsibility 
one has to iterate over all contingency sets: not surprisingly, com- 
puting responsibility in general is hard for FP^2 (^^g ^) | 5 1 i How- 
ever, our second main result, and at the same time the strongest re- 
sult of this paper, is a dichotomy result for conjunctive queries: for 
each query without self-joins, either its responsibility can be com- 
puted in PTIME in the size of the database (using a non-obvious al- 
gorithm), or checking if it has a responsibility below a given value 
is NP-hard. 



Example 1.2 (IMDB continued). Continuing \Example 7~7] 
we show in \Fig. 2b\ the causes for Musical ranked by their responsi- 
bility score. (We explain in \Sect. 2\ how these scores are computed.) 

^This is the class of functions computable by a poly-time Turing 
machine which makes log n queries to a Ti^ oracle. 



At the top of the list is the movie "Sweeney Todd", which is, indeed, 
the one and single musical movie directed by Tim Burton. Thus, 
this tuple represents a surprising fact in the data of great inter- 
est to the user The next three tuples in the list are directors, whose 
last name is Burton. These tuples too are of high interest to the user 
because they indicate that the query was ambiguous. Equally inter- 
esting is to look at the bottom of the ranked list. The movie "Manon 
Lescaut'' is made by Humphrey Burton, afar less known director 
specialized in musicals. Clearly, the movie itself is not an inter- 
esting explanation to the user; the interesting explanation is the 
director, showing that he happens to have the same last name, and 
indeed, the director is ranked higher while the movie is (correctly) 
ranked lower In our simple example Musical has a small lineage, 
consisting of only ten tuples. More typically, the lineage can be 
much larger (Music has a lineage with 127 tuples), and it is critical 
to rank the potential causes by their degree of responsibility. 

We start by adapting the Halpern and Pearl definition of causality 
(HP from now on) to database queries, based on contingency sets. 
We define causality and responsibility both for Why-So queries 
("why did the query return this answer?") and for Why-No queries 
("why did the query not return this answer?"). We then prove two 
fundamental results. First, we show that computing the causes to 
any conjunctive query can be done in PTIME in the size of the 
database, i.e. query causality has PTIME data complexity; by con- 
trast, causality of arbitrary Boolean expressions is NP-complete 
||9||. In fact we prove something stronger: the set of all causes can be 
retrieved by a query expressed in First Order Logic (FO). This has 
important practical consequences, because it means that one can re- 
trieve all causes to a conjunctive query by simply running a certain 
SQL query. In general, the latter cannot be a conjunctive query, but 
must have one level of negation. However, we show that if the user 
query has no self joins and every table is either entirely endogenous 
or entirely exogenous, then the Why-So causes can be retrieved by 
some conjunctive query. These results are summarized in |Fig. 3| 

Second, we give a dichotomy theorem for query responsibility. 
This is our strongest technical result with this paper. For every con- 
junctive query without self-joins, one of the following holds: either 
the responsibility can be computed in PTIME or it is provably NP- 
hard. In the first case, we give a quite non-obvious algorithm for 
computing the degrees of responsibility using FordFulkerson's max 
flow algorithm. We further show that one can distinguish between 
the two cases by checking a property of the query expression that 
we call linearity. We also discuss conjunctive queries with self- 
joins, and finally show that, in the case of Why-No causality, one 
can always compute responsibility in PTIME. These results are also 
summarized in [Fig. 3| 

Causality and provenance: Causality is related to lineage of 
query results, such as why-provenance |7 1 or where-provenance |2j. 
Recently, even explanations for non-answers have been described 
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Figure 3: Complexity of determining causality and responsibil- 
ity for conjunctive queries. For queries with no self-joins we 
provide a complete dichotomy result. Queries with self -joins 
are NP-hard in general, but a similar dichotomy is not known. 



in terms of lineage p31[3|. We make use of this prior work because 
the first step in computing causes and responsibiUties is to deter- 
mine the Uneage of an answer or non-answer to a query. We note, 
however, that computing the Uneage of an answer is only the first 
step, and is not sufficient for determining causality: causality needs 
to be established through a contingency set, and is also accompa- 
nied by a degree (the responsibility), which are both more difficult 
to compute than the lineage. 

Contributions and outline. Our three main contributions are: 

• We define Why-So and Why-No causality and responsibility 
for conjunctive database queries ( [Sect. 2) . 

• We prove that causality has PTIME data complexity for con- 
junctive queries ( [Sect. 3| ). 

• We prove a dichotomy theorem for responsibility and con- 
junctive queries ( [Sect. 4| ). 

We review related work ( [Sect. 5\ befo re we conclude ( [Sect. 6\ . All 
proofs are provided in the | Appendix] 

2. QUERY CAUSE AND RESPONSIBILITY 

We assume a standard relational schema with relation names 
Ri, . . . , Rk. We write D for a database instance and q for a query. 
We consider only conjunctive queries, unless otherwise stated. A 
subset of tuples C D represents endogenous tuples; the com- 
plement = D — is called the set of exogenous tuples. For 
each relation we write R^ and R^ to denote the endogenous and 
exogenous tuples in Ri respectively. If a is a tuple with the same 
arity as the query's answer, then we write D \= q{a) when a is an 
answer to q on D, and write D ^ q{a) when a is a non-answer to 
q on D. 

Definition 2.1 (Causality). Let t e D"" be an endogenous 
tuple, and a a possible answer for q. 

• t is called a counterfactual cause/or a in D ifD \= q{a) and 
D-{t]^ q{a) 

• t ^ D is called an actual cause /6>r a if there exists asetV C 

called a contingency for t, such that t is a counterfactual 
cause for a in D — F. 

A tuple t is a counterfactual cause, if by removing it from the 
database, we remove a from the answer. The tuple is an actual 
cause if one can find a contingency under which it becomes a cou- 
nterfactual cause: more precisely, one has to find a set F such that, 
after removing F from the database we bring it to a state where 
removing/inserting t causes a to switch between an answer and a 



non-answer. Obviously, every counterfactual cause is also an ac- 
tual cause, by taking F = 0. The definition of causality extends 
naturally to the case when the query q is Boolean: in that case, a 
counterfactual cause is a tuple that, when removed, determines q to 
become false. 

Example 2.2. Consider the query q(x) :— R(x,y), S{y) on 
the following database instance, and assume all tuples are endoge- 
nous: R — R^, S — S^. Consider the answer a2. The tuple S{ai) 
is a counterfactual cause for this result, because if we remove this 
tuple from S then a2 is no longer an answer. Now consider the 
answer a a. Tuple S^as) is not a counterfactual cause: if we re- 
move it from S, a a is still an answer. But S{a^) is an actual cause 
with contingency {S{a2)}: once we remove S{a2) we reach a state 
where a a is still an answer, but further removing S^as) makes a a a 
non-answer. 

q{x) ■. -R{x,y )S{y) 
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For a more subtle example, consider the Boolean query q :— 
R{x, as), S{a3) (where as is a constant), which is true on the given 
instance. Suppose only the first three tuples in R are endogenous, 
and the last two are exogenous: R^ = {(^4, ^3), (0^4 , 02)}. Let's 
examine whether R^(as, as) is a cause for the query being true. 
This tuple is not an actual cause. This is because {S^{as)} is not a 
contingency for R^ (as ,as): by removing (as ) from the database 
we make the query false, in other words the tuple R^(as, as) makes 
no difference, under any contingency. Notice that {R^{aA,as)} is 
not contingency because R^{aA, as) is exogenous. 

In this paper we discuss two instantiations of query causality. 
In the first, called Why-So causality, we are given an actual an- 
swer a to the query, and would like to find the cause(s) for this 
answer. |Definition 2.1| is given for Why-So causality. In this case 
D is the real database, and the endogenous tuples are a given 
subset, while exogenous are = D — D^.In the second instanti- 
ation, called Why-No causality, we are given a non-answer a to the 
query, i.e. would like to know the cause why a is not an answer. 
This requires some minor changes to |Definition 2.1| Now the real 
database consists entirely of exogenous tuples, D^. In addition, we 
are given a set of potentially missing tuples, whose absence from 
the database caused a to be a non-answer: these form the endoge- 
nous tuples, D^, and we denote D — [J D^. We do not discuss 
in this paper how to compute D^: this has been addressed in recent 
work 1 15 |. In this setting, the definition of the Why-No causality 
is the dual of |Def. 2.7] and we give it here briefly: a counterfactual 
cause for the non-answer a in is a tuple t ^ s.t. ^ q{a) 
and U {t} \= q{a)\ an actual cause for the non-answer a is a 
tuple t ^ s.t. there exists a set F C called contingency set 
s.t. t is a counterfactual cause for the non-answer of a in U F. 

We now define responsibility, measuring the degree of causality. 

Definition 2.3 (Responsibility). Let a be an answer or 
non-answer to a query q, and let t be a cause (either Why-So, or 
Why -No cause). The responsibility of t for the (non-)answer a is: 

_ 1 

l + minr|F| 

where F ranges over all contingency sets for t. 

Thus, the responsibility is a function of the minimal number of 
tuples that we need to remove from the real database D (in the case 
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of Why-So), or that we need to add to the real database (in the 
case of Why-No) before it becomes counterfactual. The tuple t is a 
counterfactual cause iff pt = 1, and it is an actual cause iff pt > 0. 
By convention, if t is not a cause, pt = 0. 



lineage of ^ is: 



V 



Example 2.4 (IMDB co ntinued). [Figure 2a\ shows the lin- 
eage of the answer Musical in \Example 1.1\ Consider the movie 
"Sweeney Todd": its responsibility is 1/3 because the smallest 
contingency is: {'D±rector(David, Burton), Direct or (Humphrey, 
Burton)} (if we remove both directors, then "Sweeney Todd" be- 
comes counterfactual). Consider now the movie "Manon Lescaut": 
its responsibility is 1/5 because the smallest contingency set is 
{Director (David, Burton), Kovle(" Flight"), Kovle( "Candide"), 
Direct orf r/m, Burton)}. 

We now define formally the problems studied in this paper. Let 
D = U D^be 3. database consisting of endogenous and exoge- 
nous tuples, ^ be a query, and a be a potential answer to the query. 

Causality problem Compute the set C C of actual causes for 
the answer a. 

Responsibility problem For each actual cause t ^ C, compute its 
responsibility pt. 

We study the data complexity in this paper: the query q is fixed, 
and the complexity is a function of the size of the database instance 
D. In the rest of the the paper we restrict our discussion w.l.o.g. 
to Boolean queries: if q{x) is not Boolean, then to compute the 
causes or responsibilities for an answer a it suffices to compute the 
causes or responsibilities of the Boolean query q[a/x], where all 
head variables are substituted with the constants in a. 

3. COMPLEXITY OF CAUSALITY 

We start by proving that causality can be computed efficiently; 
even stronger, we show that causes can be computed by a rela- 
tional query. This is in contrast with the general causality problem, 
where Eiter |9 1 has shown that deciding causality for a Boolean ex- 
pression is NP-complete. We obtain tractability by restricting our 
queries to conjunctive queries. Chockler et al. |6I| have shown that 
causality for "read once" Boolean circuits is in PTIME. Our results 
are strictly stronger: for the case of conjunctive queries without 
self -joins, queries with read-once lineage expressions are precisely 
the hierarchical queries |^j20J, while our results apply to all con- 
junctive queries. The results in this section apply uniformly to both 
Why-So and Why-No causality, so we will simply refer to causality 
without specifying which kind. Also, we restrict our discussion to 
Boolean queries only. 

We write positive Boolean expressions in DNF, like $ = (Xi A 
) V {Xi AX2 AX3 ) V {Xi AX 4); sometimes we drop A, and write 
$ = X1X3 VX1X4. A conjunct c is redundant if there 

exists another conjunct c that is a strict subset of c. Redundant con- 
juncts can be removed without affecting the Boolean expression. 
In our example, X1X2X3 is redundant, because it strictly contains 
it can be removed and $ simplifies to X1X2, V A 
positive DNF is satisfiable if it has at least one conjunct; otherwise 
it is equivalent to false and we call it unsatisfiable. 

Next, we review the definition of lineage. Fix a Boolean conjunc- 
tive query consisting of m atoms, q — gi, . . . ,gm, and database 
instance D; recall that D = U (exogenous and endogenous 
tuples). For every tuple t G D, let Xt denote a distinct Boolean 
variable associated to that tuple. A valuation for g is a mapping, 
: Var{q) Adom{D), where Adom{D) the active domain of 
the database, such that the instantiation of every atom is a tuple in 
the database: ti = 0{gi) G D for i = 1, . . . , m. We associate to 
the valuation the following conjunct: = A . . . A Xt^ . The 



We will assume w.l.o.g. that D"" ^ q and {D"" U D"") |= q (oth- 
erwise we have no causes). 

Definition 3.1 (n-LiNEAOE). r/^^ n-lineage 6>/^ 
= ^[Xt := true,Vt G D""] 

Here ^[Xt := true, Vt G D^] means substituting Xt with true, 
for all Boolean variables Xt corresponding to exogenous tuples t. 
Thus, the n -lineage is obtained as follows. Compute the standard 
lineage, over all tuples (exogenous and endogenous), then set to 
true all exogenous tuples: the remaining expression depends only 
on endogenous tuples. The following technical result allows us to 
compute the causes to answers of conjunctive queries. 

Theorem 3.2 (Causality). Let q be a conjunctive query, 
and t be an endogenous tuple. Then the following three conditions 
are equivalent: 

1. t is an actual cause for q \Def. 2.1\ . 

2. There exists set of tuples F C such that the lineage 
^[Xu — false, Vi^ G F] is satisfiable, and^[Xu — false, 
\/u G T]Xt — false] is unsatisfiable. 

3. There exists a non-redundant conjunct in the n-lineage 
that contains the variable Xt. 

We give the proof in the Appendix. The theorem gives a PTIME 
algorithm for computing all causes of q: compute the n-lineage 
as described above, and remove all redundant conjuncts. All tuples 
that still occur in the lineage are actual causes of q. 

Example 3.3. Consider q :— R(x, y), S{y), y — as over the 
database of^xample 2.2 
Xl. 



X 



Its lineage is $ = ^H(a3,a3)^s(a3) V 

Assume R{a4, as) is exogenous and R{a3, as), 
S{a3) are endogenous. Then the n-lineage is obtained by set- 
ting X^^a^^a^) = true.- = XH(a3,a3)-^s(a3) ^ Xg^a^^)- Af- 
ter removing the redundant conjunct, the n-lineage becomes = 
-^S(a3); hence, S{a3) is the only actual cause for the query. 

In the rest of this section we prove a stronger result. Denote Or 
the set of actual causes in the relation R; that is, Cr C R^, and 
every tuple t G Cr is an actual cause. We show that Cr can be 
computed by a relational query. In particular, this means that the 
causes to a (non-) answer can be computed by a SQL query, and 
therefore can be performed entirely in the database system. 

Theorem 3.4 (Causality FO). Given a Boolean query q 
over relations i?i , . . . Rk, the set of all causes ofq {Cr^ , • • • , Cr^ } 
can be expressed in non-recursive stratified Datalog with negation, 
with only two strata. 



[Theorem 3.4[ shows that causes can be expressed in a language 
equivalent to a subset of first order logic |1| and that, moreover, 
only one level of negation is needed. The proof is in the appendix. 

Example 3.5. Continuing with the query q : —R{x, y), S{y) 
from \Example 3.3} , suppose all tuples in S are endogenous. Thus, 
we have R^,R^, S^, but = 0. The complete Datalog program 
that produces the causes for q is: 



i{y) 

CR(x,y) 
Cs{y) 
Cs{y) 



R^ix,y),S'iy) 
R^x,y),S^y),^I{y) 
R^x,y),S^y),^I{y) 
R\x,y),S\y) 
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The role of^I(y) is to remove redundant terms from the lineage. 
To see this, consider the database R = {(a4, as), (as, as)}, S = 
S"^ — {as}, and assume that W — {(as, as)}, — {(a4,as)}, 
thus, q 's lineage and n-lineage are: 

^ = -^H(a4,a3)-^S(a3) V ^^(ag ,a3)-^S(a3) 
= -^S(a3) V Xi^(a3,a3)-^S(a3) = -^^(aa) 

Thus, the only actual cause of q is S{az). Consider Cr, which 
computes causes in R. Without the negated term ^I{y), Cr would 
return i?(as , as) (which would be incorrect). The role of the negated 
term ^I{y) is to remove the redundant terms in in our exam- 
ple, Cr returns the empty set (which is correct). Similarly, one can 
check that Cs returns S{a2>). Note that negation is necessary in 
Cr because it is non-monotone: if we remove the tuple R{a4, as) 
from the database then i?(as, as) becomes a cause for the query q, 
thus Cr is non-monotone. Hence, in general, we must use negation 
in order to compute causes. 

Example 3.6. Consider q:— S{x), R{x,y), S{y), and assume 
that S is endogenous and R is exogenous: in other words, S = S^, 
R — R^. The following Datalog program computes all causes: 

I{x) :-5'"(x),i?'(x,x) 
Cs{x) :- S\x),R\x, y), S\y), -/(x), -^I{y) 
Cs{y) :- S\x),R\x, y), S\y), ^/(x), ^I{y) 

Here, too, we can prove that Cs is non-monotone and, hence, must 
use negation. Consider the database instance R — {(a4,as), 
(as, as)}, S = {as,a4}. Then S{aA) is not a cause; but if we 
remove R{a3, as), then S{a4) becomes a cause. 

As the previous examples show, the causaUty query C is, in 
general, a non-monotone query: by inserting more tuples in the 
database, we determine some tuples to no longer be causes. Thus, 
negation is necessary in order to express C. The following corol- 
lary gives a sufficient condition for the causality query to simplify 
to a conjunctive query. 

Corollary 3.7. Suppose that each relation Ri is either en- 
dogenous or exogenous ( that is, either R^ = Ri or R^ = Ri ). 
Further, suppose that, if Ri is endogenous, then the relation sym- 
bol Ri occurs at most once in the query q. Then, for each relation 
name Ri, the causal query Cr- is a single conjunctive query (in 
particular it has no negation). 

The two examples above show that the corollary is tight: |Ex-| 
[ample 3.5] shows that causality is non-monotone when a relation 
is mixed endogenous/exogenous, and |Example 3. 6| shows that cau- 
sality is non-monotone when the query has self-joins, even if all 
relations are either endogenous or exogenous. 

To illustrate the corollary, we revisit [Example 33] where the 
query is q :— R(x, y), S{y), and assume that i?^ = and = 0. 
Then the Datalog program becomes: 

CR{x,y) ■.-R"{x,y),S"{y) 
Csiv) ■.-R''{x,y),S\y) 

4. COMPLEXITY OF RESPONSIBILITY 

In this section, we study the complexity of computing respon- 
sibility. As before, we restrict our discussion to Boolean queries. 
Thus, given a Boolean query q and an endogenous tuple t, compute 
its responsibility pt ( |Def. 2.3\ . We say that the query is in PTIME if 
there exists a PTIME algorithm that, given a database D and a tuple 
t computes the value pt ; we say that the query is NP-hard, or sim- 
ply hard, if the problem "given a database instance D and a number 



V, check whether pt > v'' is NP-hard. The strongest result in this 
section and the paper is a dichotomy theorem for Why-So queries 
without self-joins: for every query, computing the responsibility is 
either in PTIME or NP-hard ( [Sect. 4.i\ . The case of non-answers 
(Why-No) turns out to be a simpler problem as [Sect. 4.^ shows. 

4.1 Why So? 

We assume that the conjunctive query q is without self -joins, 
i.e. every relation occurs at most once in q; we discuss self-joins 
briefly at the end of the section. W.l.o.g. we further assume that 
each relation is either fully endogenous or exogenous (R^ = Ri 
or R^ = Ri). Recall that computing the Why-So responsibility of 
a tuple t requires computing the smallest contingency set F, such 
that t is a counterf actual cause in D — F. We start by giving three 
hard queries, which play an important role in the dichotomy result. 

Theorem 4. 1 (Canonical Hard Queries). Each of the 
following three queries is NP-hard: 

hi ■.-A\x),B\y),C\z),W{x,y,z) 

hi ■.-R\x,y),S''{y,z),T^{z,x) 

hi :- A\x), B\y), Ciz), R{x, y), S{y, z), T{z, x) 

If the type of a relation is not specified, then the query remains hard 
whether the relation is endogenous or exogenous. 

We give the proof in the Appendix: we prove the hardness of h\ 
and /i2 directly, and that of h%, by using a particular reduction from 
/i2. Chockler and Halpern | 5 | have already shown that comput- 
ing responsibility for Boolean circuits is hard, in general. One may 
interpret our theorem as strengthening that result somewhat by pro- 
viding three specific queries whose responsibility is hard. However, 
the theorem is much more significant. We show in this section that 
every query that is hard can be proven to be hard by a simple re- 
duction from one of these three queries. 

Next, we illustrate PTIME queries, and start with a trivial ex- 
ample q :— i?(a, y) where a is a constant. If t = i?(a, h), then its 
minimum contingency is simply the set of all tuples R{a,c) with 
c / 6, and one can compute t's responsibility by simply counting 
these tuples. Thus, q is in PTIME. We next give a much more subtle 
example. 

Example 4.2 (Ptime Query). Let q :- R(x, y), S(y, z), let 
both R and S be endogenous, and w.l.o.g. let t be a tuple in R. We 
show how to compute the size of the minimal contingency set T for 
t with a reduction to the max-fiow/min-cut problem in a network. 
Given the database instance D, construct the network illustrated 
in \Fig. 4\ Its vertices are partitioned into Vi U . . . U V5. Vi con- 
tains the source, which is connected to all nodes in V2. There is one 
edge (x, y)from V2 to Vsfor every tuple {x,y) G R, and one edge 
(2/, z)from Vs to Va for every tuple (y, z) G S. Finally, every node 
in V4 is connected to the target, in V5. Set the capacity of all edges 
from the source or into the target to 00. The other capacities will 
be described shortly. Recall that a cut in a network is a set of edges 
F that disconnect the source from the target. A min-cut is a cut of 
minimum capacity, and the capacity of a min-cut can be computed 
in PTIME using Ford-Fulkerson 's algorithm. Now we make an im- 
portant observation: any mincut F in the network corresponds to 
a set of tuples^ in the database D = RU S, such that q is false 
on D — V. We use this fact to compute the responsibility of t as 
follows: First, set the capacity oft to 0, and that of all other tuples 
in R, S to I. Then, repeat the following procedure for every path p 

^In other words, the mincut cannot include the extra edges con- 
nected to the source or the target as they have infinite capacity. 
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Figure 4: Flow transformation for q :— R(x,y), S{y, z). 
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(a) q (b) hi 

Figure 5: Dual query hypergraphs for easy query 

q :- ^1 (x, v), S2{v, y), R{y, u), ^3(2/, z), T{z, w), B{z), 

and hard query :- A{x), B{y), C{z),W{x, y, z) 

from the source to the target that goes through t: set the capacities 
of all edges^ in p — {t} to 00, compute the size of the mincut, and 
reset their capacities back to 1. In \Fig. 4\ there are two such paths 
p: the first is xi , ^2, zi (the figure shows the capacities set for this 
path), the other path is xi, 2/2, ^2- We claim that for every mincut 
r, the set r — {t} is a contingency set for t. Indeed, q is false on 
D — V because the source is disconnected from the target, and q 
is true on (D — F) U {t}, because once we add t back, it will join 
with the other edges in p — {t}. Note that F cannot include these 
edges as their capacity is 00. Thus, by repeating for all paths p 
(which are at most \S\), we can compute the size of the minimal 
contingency set as min |r| — 1. 

We next generalize the algorithm in [Example 4.2| to the large 
class of linear queries. We need two definitions first. 

Definition 4.3 (Dual Query Hypergraph 7i^).The dual 
query hypergraph H^iV^S) of a query q :— gi, . . . , Qm is a hyper- 
graph with vertex set V — {^1, . . . , Qm} and a hyperedge Ei for 
each variable Xi G Var{q) such that Ei — {gj \ Xi G Var{gj)}. 

Note that nodes are the atoms, and edges are the variables. This 
is the "dual" of the standard query hypergraph |11|, where nodes 
are variables and edges are atoms. 

Definition 4.4 (Linear Query). A hypergraph H{V, 8) is 
linear if there exists a total order Sv ofV, such that every hyper- 
edge e ^ 8 is a consecutive subsequence of Sv- A query is linear 
if its dual hypergraph is linear 

In other words a query is linear if its atoms can be ordered such 
that every variable appears in a continuous sequence of atoms. For 
example, the query q in |Fig. 5a| is linear. Order the atoms as A, 5*1 , 
&, i?, aSs, T, B, and every variable appears in a continuous se- 
quence, e.g. y occurs in S2,R,S3- On the other hand, none of 
the queries in [Theorem 4.1|is lin ear. For example, the dual hyper- 
graph of hi is shown in [Fig. 5b| one cannot "draw a line" through 
the vertices and stay inside hyperedges. Note that the definition of 
linearity ignores the endogenous/exogenous status of the atoms. 

^In our example, p — {t} contains a single other edge (namely a 
tuple in S). For longer queries, it may contain additional edges. For 
the query R{x, y), S{y, z),T{z, u), for example, p — {t} always 
contains two edges. Hence we refer to edges 'mp—{t} in the plural. 



Algorithm 1: Calculating responsibility for linear queries 

Input: q :— gi , . . . , gm, D and t, Output: pt 

G = flo\jGra.ph(dualHypergraph(q), D) ; 

forall source-target paths p = {ei , . . . , Cm} G G, t G p do 

capacity(e^) ^ 00, capacity(t) ^ ; 

Tj ^ maxFlow(G) ; 

return pt = tti — ■ """it^ i — tt ; 

^'^ l + (m^nj IFj I — 1) ' 

Function: f lowGraph(H, D) 

L = {gi{xi), . . . , gm{xm)} linearization of H ; 

V = {{source}, ¥{, Vi,V^, V2 . . . , V^^.Vm, {target}} ; 

forall gi do 

^/ ^ {^j I ^ ^i^i) --gi-i^gi}^ x'- ^ xi-i n xi ; 

Wu G Vi-i, V G V/, add edge e(u, v) if u(x'^) = v(x'^) ; 
capacity(e) ^ 00 ; 
forall tj G gi do 

Vi^VlU{tj}; 

Vw G V^, V G Vi, add edge e(u, v) if u(x'-) = v(x'-) ; 

if tj G then capacity(e) ^ 1 else capacity(e) ^ 00 ; 

yv G Vm, capacity(e(^', target)) ^ 00 ; 
\/v G V-[, capacity(e(source, v)) ^ 00 ; 



For every linear query, the responsibility of a tuple can be com- 
puted in PTIME using [Algorithm l] The algorithm essentially ex- 
tends the construction given [Example 4. 2 [ to arbitrary linear queries. 
Note that it treats endogenous relations differently than exogenous 
by assigning to them weight 00. Thus, we have: 

Theorem 4.5 (Linear Queries). For any linear query q and 
any endogenous tuple t, the responsibility of t for q can be com- 
puted in PTIME in the size of the database D. 

So far, [Theorem 4. 1 [ has described some hard queries, and [Theo-] 
[rem 4.5 [ some PTIME queries. Neither class is complete, hence we 
do not yet have a dichotomy yet. To close the gap we need to work 
on both ends. We start by expanding the class of hard queries. 

Definition 4.6 (rewriting '^). We define the following re- 
writing relation on conjunctive queries without self-joins: q re- 
writes to q , in notation q ^ q , if q can be obtained from q by 
applying one of the following three rules: 

• Delete x (q q[i}/x]): Here, denotes the query 
obtained by removing the variable x G Var(^q), and thus 
decreasing the arity of all atoms that contained x. 

• Add y (q ^ q[{x,y)/x]): Here, q[{x^y)/x\ denotes the 
query obtained by adding variable y to all atoms that contain 
variable x, and thus increasing their arity, provided there ex- 
ists an atom in q that contains both variables x, y. 

• Delete g (q ^ q — {g})- Here, g denotes an atom and 
q — {g} denotes the query q without the atom g, provided 
that g is exogenous, or there exists some other atom go s.t. 
Var{go) C Var{g). 

Denote ^ the transitive and reflexive closure of We show 
that rewriting always reduces complexity: 

Lemma 4.7 (Rewriting). Ifq^q' and q' is NP-hard, then 
q is also NP-hard. In particular, q is NP-hard ifq-'^ hi, where hi 
is one of the three queries in \Theorem 4.1\ 

Example 4 . 8 (Rewriting) . We illustrate how one can prove 
that the query q :— i?(x, y), S((y^ z),T(^z, u), K(^u, x) is hard, by 
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rewriting it to /i2 

q ^ R{x, y), S{y, z), T(x, z, u), K(u, x) (add x) 

^ R(x, y), S(y, z), T(x, z, u), K(u, x, z) (add z) 

^ y), S(y, z), T(x, z, u) (delete K) 

^ R{x,y), S{y, z),T(^x, z) (delete u) 

With rewriting we expanded the class of hard queries. Next we 
expand the class of PTIME queries. As notation, we say that two 
atoms Qi^Qj of a conjunctive query are neighbors if they share a 
variable: Var{gi) H Var(^j) / 0. 

Definition 4.9 (Weakening^). We define the following 
weakening relation on conjunctive queries without self-joins: q 
weakens to q , in notation q q , if q can be obtained from q 
by applying one of the following two rules: 

• Dissociation Ifg^ is an exogenous atom and Vi a variable 
occurring in some of its neighbors, then let q be obtained by 
adding Vi to the variable set of (this increases its arity). 

• Domination If g^ is an endogenous atom and there exists 
some other endogenous atom go s.t. Var(gQ) C Var(g^), then 
let q be obtained by making g exogenous, g^. 

Intuitively, a minimum contingency never needs to contain tu- 
ples from a dominated relation, and thus the relation is effectively 
exogenous. Along the lines of [Lemma 4.7| we show the following 
for weakening: 



Lemma 4.10 (Weakening). Ifq 
then q is also in PTIME. 



q and q is in PTIME, 



Thus, weakening allows us to expand the class of PTIME queries. 
We denote ^ the transitive and reflexive closure of We say that 
a query q is weakly linear if there exists a weakening q^ q' s.t. q' 
is linear. Obviously, every linear query is also weakly linear. 

Corollary 4. 1 1 (Weakly Linear Queries). Ifq is weakly 
linear, then it is in PTIME. 



[Lemma 4.10| is based on the simple observation that a weakening 
produces a query q' over a database instance D' that produces the 
same output tuples as query q on database instance D. Weakening 
only affects exogenous and dominated atoms, which are not part of 
minimum contingencies, and therefore responsibility remains unaf- 
fected. This also implies an algorithm for computing responsibility 
in the case of weakly linear queries: find a weakening of q that is 
linear and apply [Algorithm 1| 

Example 4.12. We illustrate the lemma with two examples. 
First, we show that q :- R''{x, y), S^{y, z),T''{z, x) is in PTIME 
by weakening with a dissociation: 



R'^ix, y), 5"'(x, y, z),T''{z, x) 



( dissociation ) 



The latter is linear Query q should be contrasted with /i2 in \Theo-\ 
\rem 4.1\ the only difference is that here is exogenous, and this 
causes q to be in PTIME while /i2 is NP-hard. Second, consider 
q :— ^), 7"" (2;, x), V"(x). Here we weaken with a 

domination followed by a dissociation: 

q R^{x, y), S^{y, z), T^{z, x), V^{x) (domination) 
^ R^{x, y, z), S^{y, z), T^{z, x, y), V^{x) (dissociation) 

The latter is linear with the linear order S^,R^,T^, V^. 



We say that a query q is final if it is not weakly linear and for 
every rewriting q^ q , the rewritten query q is weakly linear. For 
example, each of hl,h2, h% in [Theorem 4.1 [ is final: one can check 
that if we try to apply any rewriting to, say, /ij we obtain a linear 
query. We can now state our main technical result: 

Theorem 4.13 (Final Queries). Ifq is final, then q is one 
of hi, hi, hi. 

This is by far the hardest technical result in this paper. We give 
the proof in the appendix. Here, we show how to use it to prove the 
dichotomy result. 

Corollary 4.14 (Responsibility Dichotomy). Letqbe 
any conjunctive query without self joins. Then: 

• Ifqis weakly linear then q is in PTIME. 

• Ifqis not weakly linear then it is NP-hard. 



Proof. If q is weakly linear then it is in PTIME by [Corol-[ 
[lary 4.1l| Suppose q is not weakly linear. Consider any sequence 
of rewritings q — qo ^ qi ^ q2 ^ ■ ■ ■ Any such sequence must 
terminate as any rewriting results in a simpler query. We rewrite as 
long as qi is not weakly linear and stop at the last query qu that is 
not weakly linear. That means that any further rewriting q^ q 
results in a weakly linear query q' . In other words, qu is a final 

Thus, we have 
the query q 



query. By Theorem 4.13| qk is one of /it , h, 



proven q^ hj, for some j = 1, 2, 3. By 
is NP-hard. □ 



Lemma 4.7 



Extensions. We have shown in [Sect. 3 [ that causality can be com- 
puted with a relational query. This raises the question: if the re- 
sponsibility of a query q is in PTIME, can we somehow compute it 
in SQL? We answer this negatively: 

Theorem 4.15 (Logspace). Computing the Why-So respon- 
sibility of a tuple t G is hard for LOGSPACE /6>r the following 
query: q :- R''{x, ui,y), S'^iy, U2, z),T''{z, U3, w) 

Finally, we add a brief discussion of queries with self-joins. Here 
we establish the following result: 

Proposition 4.16 (self-joins). Computing the responsibil- 
ity of a tuple t for q :— i?"(x), S^{x, y), R^{y) is NP-hard. The 
same holds if one replaces with S^. 

We include the proof in the appendix. Beyond this result, however, 
queries with self-joins are harder to analyze, and we do not yet have 
a full dichotomy. In particular, we leave open the complexity of the 
query R\x,y),R\y,z). 

4.2 Why No? 

While the complexity of Why-So responsibility turned out to be 
quite difficult to analyze, the Why-No responsibility is much easier. 
This is because, for any query q with m subgoals and non-answer 
a, any contingency set for a tuple t will have at most m — 1 tuples. 
Since m is independent on the size of the database, we obtain the 
following: 

Theorem 4.17 (Why-No responsibility). Given a query 
q over a database instance D and a non-answer a, computing the 
responsibility oft G over is in PTIME. 
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5. RELATED WORK 

Our work is mainly related to and unifies ideas from work on 
causality, provenance, and query result explanations. 

Causality. Causality is an active research area mainly in logic 
and philosophy with its own dedicated workshops (e.g. |23 |). The 
idea of counterfactual causality (if X had not occurred, Y would 
not have occurred) can be traced back to Hume 1 19 1, and the best 
known counterfactual analysis of causation in modern times is due 
to Lewis 1 16 |. Halpern and Pearl 1 13] define a variation they call 
actual causality which relies upon a graph structure called a causal 
network, and adds the crucial concept of a permissive contingency 
before determining causality. Chockler and Halpern 1 5 1 define the 
degree of responsibility as a gradual way to assign causality. Our 
definitions of Why-So and Why-No causality and responsibility for 
conjunctive queries build upon the HP definition, but simplify it and 
do not require a causal network. A general overview of causality in 
a database context is given in 1 17 1, while [18J introduces functional 
causality as an improved, more robust version of the HP definition. 

Provenance. Approaches for defining data provenance can be 
mainly classified into three categories: how-, why-, and where- 
provenance |2j|4j|7j|^12 1. We point to the close connection between 
why-provenance and Why-So causality: both definitions concern 
the same tuples if all tuples in a database are endogenous"^. How- 
ever, our work extends the notion of provenance by allowing users 
to partition the lineage tuples into endogenous and exogenous, and 
presenting a strategy for constructing a query to compute all causes^. 
In addition, we can rank tuples according to their individual respon- 
sibilities, and determine a gradual contribution with counterfactual 
tuples ranked first. 

Missing query results. Very recent work focuses on the prob- 
lem of explaining missing query answers, i.e. why a certain tuple 
is not in the result set? The work by Huang et al. 1 15 | and the 
Artemis 1 14 1 system present provenance for potential answers by 
providing tuple insertions or modifications that would yield the 
missing tuples. This is equivalent to providing the set of endoge- 
nous tuples for Why-No causality. Alternatively, Chapman and Ja- 
gadish focus on the operator in the query plan that eliminated a 
specific tuple, and Tran and Chan f22l suggest an approach to au- 
tomatically generate a modified query whose result includes both 
the original query's results as well as the missing tuple. 

Our definitions of Why-So and Why-No causality highlight the 
symmetry between the two types of provenance ("positive and neg- 
ative provenance"). Instead of considering them in separate man- 
ners, we show how to construct Datalog programs that compute all 
Why-So or Why-No tuple causes given a partitioning of tuples into 
endogenous and exogenous. Analogously, responsibility applies to 
both cases in a uniform manner. 

6. CONCLUSIONS 

In this paper, we introduce causality as a framework for explain- 
ing answers and non-answers in a database setting. We define 
two kinds of causality, Why-So for actual answers, and Why-No 
for non-answers, which are related to the provenance of answers 
and non-answers, respecitively. We demonstrate how to retrieve all 
causes for an answer or non-answer using a relational query. We 
give a comprehensive complexity analysis of computing causes and 
their responsibilities for conjunctive queries: whereas causality is 

"^Note that why-provenance (also called minimal witness basis) de- 
fines a set of sets. To compare it with Why-So causality, we con- 
sider the union of tuples across those sets. 

^Note that, in general, Why-So tuples are not identical to the subset 
of endogenous tuples in the why-provenance. 



shown to be always in PTIME, we present a dichotomy for respon- 
sibility within queries without self-joins. 
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APPENDIX 

A. NOMENCLATURE 



D 

A,B,C, R, 

r 

Pt 
Xt 

Var{q) 

Adom{D) 

sg{x) 

<|)n 



database instance 
S,T,W relations 

is a fully endogenous relation 

is a fully exogenous relation 
set of endogenous tuples: C D for Why-So 
set of exogenous tuples: = D — 
set of endogenous tuples in relation Ri 
contingency: F C 
set of causes in relation Ri 
responsibility of tuple t 
Boolean variable associated with tuple t 
variables appearing in query q 
active domain 

set of subgoals containing variable x 
lineage 

endogenous lineage (n-lineage) 
dual hypergraph 
rewriting of a query 
weakening of a query 
canonical hard queries of | Theorem 4.l] 



B. PROOFS CAUSALITY 

Proof ITheorem 3.2I Assume the lineage of a in D. 

We construct the endogenous lineage ^'^'^"^ = = true), 

and a DNF with all the non-redundant clauses of ^. We will 
show that a variable Xt is a cause of a, iff Xt G ^^ which means 
that Xt is part of a non-redundant clause in the endogenous lineage 
of a. 

Case A: (Why-So, answer a): First of all, if Xt is not in ^^ then 
Xt is not a cause of a, as there is no assignment that makes Xt 
counterfactual for (and therefore ^), because of monotonicity. 
If Xt e Ci, where d a clause of we select F = {Xj \ Xj G 
^'andXj ^ a} (F C since VX, G Xj G D""). Then, 
if we write = V we know that "^"{Xr false) 
false, because contains only non-redundant terms. That means 
that every clause Cj has at least one variable that is not in and 
therefore can be negated by the above choice of F. This makes 
Xt counterfactual for (and also ^) with contingency F. Since 
FnD'' = is also counterfactual for ^"^^""^ with contingency F, 
meaning that ^'^'^^\Xr = false) is satisfiable, and $'^*^^^(Xr = 
false, = false) is unsatisfiable. Therefore, conditions 1, 2, 
and 3 are equivalent. 

Case B: (Why-No, non-answer a): First of all, if Xt is not in ^^ 
then Xt is not a cause of a, as there is no assignment that makes Xt 
counterfactual for (and therefore ^), because of monotonicity. 
If Xt e Ci, where d a clause of ^^ we select F^ = {Xj \ Xj G 
d andXj / Xt}, and assign F = i:>" - F' U {t}. F, F' C D"" 
since \/Xj G Xj G D"". Then, if we write = a V we 
know that ^^^(Xr = false) = false, because contains only 
non-redundant terms. That means that every clause Cj has at least 
one variable that is not in d, and therefore can be negated by the 
above choice of F. This makes Xt counterfactual for (and there- 
fore ^) with contingency F^ Since T'nD^ = Xt is also counter- 
factual for with contingency F^ meaning that $^*^"^(Xr = 
false) is unsatisfiable, and ^'^^''\Xr = false, Xt = false) is 
satisfiable. Therefore, conditions 1, 2, and 3 are equivalent. □ 

Proof ( [Theorem 3.4| ). To describe the relational query we 
need a number of technical definitions. Recall that Ri denote 
the endogenous/exogenous tuples in Ri. Given a Boolean con- 
junctive query q :— gi{xi), . . . , gm{xm) we define a refinement to 



be a query of the form r :— gl'^ (xi), . . . , g^ (xm), where each 
Si G {n,x}. Thus, every atom is made either exogenous or en- 
dogenous, and we call it an n- or an x-atom; there are 2"^ refine- 
ments. Clearly, q is logically equivalent to the union of the 2"^ 
refinements, and its lineage is equivalent to the disjunction of the 
lineages of all refinements. Consider any refinement r. We call 
a variable x G Var{r) an n-variable if it occurs in at least one 
n-atom. We apply repeatedly the following two operations: (1) 
choose two n- variables x, y and substitute y := x; (2) choose any 
n-variable x and any constant a occurring in the query and substi- 
tute X := a. We call any query s that can be obtained by applying 
these operations any number of times an image query; in particular, 

the refinement r itself is a trivial image. There are strictly less than 

,2 

2 images, where k is the total number of n-variables and con- 
stants in the query. Note that k is bounded by query size and thus 
irrelevant to data complexity. We always minimize an image query. 

Fix a refinement r. We define an n-embedding for r as a function 
e : r ^ s that maps a strict subset of the n-atoms in r onto all n- 
atoms of s, where s is the image of a possibly different refinement 
r . Intuitively, an n-embedding is a proof that a valuation for r 
results in a redundant conjunct, because it is strictly embedded a 
the conjunct of a valuation of r^ 

We now describe in non-recursive, stratified Datalog the rela- 
tional query that computes all causes 

Is,e{e{e~^{y))) :- atoms(s) (1) 
CH,(x,):-atoms(r) A /\ ^/,(e"^(^)) (2) 

e:r— )-s 

There is one IDB predicate /s,e(e(e~^ (^))), for each possible 
embedding e : r ^ s, and it appears in one single rule, whose 
left hand side the same as s. The head variables are all n-variables 
y in s, where each y is repeated |e~^(^)| times: this is the pur- 
pose of the e o function. For example, if the embedding is 
e : i?''(xi,X2,X3) R''{y,y,y),thQne~^{y) = (xi,X2,X3) and 
e{e~'^{y)) = {y, y, y), hence the IDB is /(^, 2/, y). Next, there is 
one IDB predicate Cr- for every relation name Ri, and there are 
one or more rules for Cr- : one rule jEq. 2] for each refinement r and 
each n-atom (^j) in r that refers to the relation R\. 

Therefore, the Datalog program consisting of |Eq. l| and |Eq. 2] 
computes the set of all causes to the Boolean query q and returns 
them in the IDB predicates Ci , . . . , Cfc. □ 



Proof ( [Corollary 3.7| ). The proof is immediate: there ex- 
ists a single refinement, which has no embedding. □ 

C. CANONICAL HARD QUERIES 



Proof (Theorem 4.1 h\ :- i?(x), S{y),T{z), W{x, y, z)). 



We demonstrate hardness of qi with a reduction from the minimum 
vertex cover problem in a 3-partite 3-uniform hypergraph: Given an 
3-partite, 3-uniform hypergraph and constant K, determine if there 
exists a vertex cover of size less or equal to K. This problem is 
shown to be hard in | 21 1. 

Take a 3-partite 3-uniform hypergraph such as the one from jFig. 6a| 
The nodes can be divided into 3 partitions {R, S and T), such that 
every edge contains exactly one node from each partition. We con- 
struct 4 database relations R{x), S{y), T{z) and FK(x, 2/, z). For 
each node in the R partition of the hypergraph, we add a tuple in 
R{x), and equivalently for S and T. Also, for each edge of the 
hypergraph, we add a tuple in FK(x, y^ z). Finally, we add an addi- 
tional tuple to each relation: ro = (xo), so = {yo),to = (zq) and 
Wo - (xo,yo,^o). 
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Figure 6: (a) Example 3-partite 3-uniform hypergraph. (b) 
Database instance created from the hypergraph of (a). 



The database instance corresponding to the hypergraph of |Fig. 6a| 
is shown in |Fig. 6bl Now consider the join query 

Q(x, z) :- R(x), S{y), T{z), W(x, y, z) 

The responsibiUty of tuple ro, (equivalently so, to or wq), is equal 



to 



l + l<5| 



where S is the minimum contingency set for tuple ro. 
Therefore, S contains the minimum number of tuples that make ro 
counterfactual. Note that so, to and wq cannot be contained in S, 
as they are the only tuples that join with ro. If 5* a minimum con- 
tingency, then if wi G <S, then 3S' — {S\ {wi}} U {r^}, where 
Tj .X = Wi .X, and S' is also a contingency of the same size as S and 
therefore minimum. Therefore, there exists minimum contingency 
S that only contains tuples from relations R, S and T. The tuples of 
R, S and T correspond to hypergraph nodes, and a contingency cor- 
responds to a cover: if an edge was not covered, then there would 
exist a corresponding tuple that was not eliminated. Also the cover 
is minimum: if there existed a smaller cover, then there would exist 
a smaller contingency. Therefore, computing responsibility for h\ 
is hard. □ 



Proof ( [Theorem A.l\ hl :- i?(x, y), S{y, z),T{z, x)). We 
will demonstrate hardness for computing responsibility for /i2 us- 
ing a reduction from 3 SAT. For the reduction we will construct a 
3-colored graph C^. The nodes are colored with a, b, or c, and 
we call them a-nodes, 6-nodes, and c-nodes respectively. Any such 
graph corresponds uniquely to an T instance: Every unique 
a, b, and c-node maps to a unique value from the domain of at- 
tribute X, y, and z respectively. Therefore, R contains all a ^ 6 
edges, i.e. it contains all tuples (u, v) where u is an a-node and v 
is a 6-node and there is an edge u ^ v, S contains all 6 ^ c edges, 
and T all c ^ a edges. The important property is that /i2 is true 
on the instance i?, 5*, T iff G(f) has a cycle of length 3; note that the 
nodes on such a cycle necessarily have colors a, 6, c. From here 
on, we will use the term cycle to refer to a cycle of length 3. 

Given a 3-colored graph G^, we define the minimum contin- 
gency of Gcf) as the smallest set of edges that contains at least 
one edge from each cycle. This notion is in fact directly equiva- 
lent to the notion of contingency F for tuples defined in jPef. 2.3] 
a set r is a minimum contingency for tuple i?(ao, bo) in U 
{i?(ao, &o), S{bQ^ co),T(co, ao)}, iff F is a subset of the edges of 
Gel), and is a minimum contingency for C^. We will show that 
given a 3-colored graph G and a number m, checking if G has a 
contingency set of size < m is NP-hard. 

Reduction of 3SAT to the colored graph G^: 
Let = Ar=i instance of 3SAT, where d are 3-clauses 

over some set X = {Xi , X2, . . . Xn} of variables. For each vari- 
able Xi we construct a graph Gi called the local ring of Xi, an 
example of which is given in |Fig. 7| The construction is as follows: 

• Pick rrii odd and multiple of 3, such that > 9 1 Cx^ \ , where 




Figure 7: A local ring of length m = 9. 

I Gxi I is the number of clauses the variable Xi appears in. rrii 
is the length of the ring Gi. 



} 



Create two ordered sets of nodes = {i 
and V" = {v^,V2,---,v~J. 

We assign the colors a, 6, and c to these nodes as follows: 



.}, andy = {a^ , 6. 



In other words, every node vj+i is an a-node (denoted as a 



•}• 



square node in Fig. 7 1, every node ^3^+2 ^ ^-node (denoted 
as a circle), and every node ^3^+3 is a c-node (denoted as a 
triangle). Similarly for V~ . 
Create the forward edges {vj',v~_^-^) and {vj 



j < rrii, as well as ( v^. , v-^ ) and {v„ 
are shown as solid in |Fig. 7| 

Create the backward edges (v^ 



Vj-^i) for all 
Forward edges 



-i.^r)' i,v^ ), (i 



2) and {v- ,^;^_2)forall 
jt), and 



j > 2, as well as (v, 

{v^. , v^). Backward edges are shown as dotted in Fig. 7 

The collection of the local rings Gi is the global graph Gc/). We will 
prove a series of lemmas that we need to demonstrate the hardness 
of computing the minimum contingency of Gcf) 

Lemma C. 1 . A local ring Gi has a minimum contingency com- 
prised solely of forward edges. 

Proof. This is straightforward: all cycles in Gi are comprised 
by 1 backward and 2 forward edges, and each backward edge par- 
ticipates in exactly one cycle. If a backward edge is part of a contin- 
gency set, it can always be replaced by one of the forward edges on 
the same cycle, resulting in a contingency set of at most the same 
size. □ 

From here on whenever we refer to a minimum contingency in a 
local ring, it will be implied that it includes forward edges only. 

Lemma C.2. If rrii is odd, there are exactly 2 minimum contin- 
gencies of size rrii that include only forward edges. 

Proof. Since rrii is odd, all forward edges in a local ring form 
a cyclic path going through all the nodes in U ~ : 



{at, 62 ), (62 , ct), . . . , (c+ , tti ), (tti , bt), 



Every two consecutive edges belong to a cycle (with the back- 
wards edge), hence a contingency set has to contain one of the two 
edges. Thus, there are two minimum contingency sets: one con- 
sisting of all {v~^ ,v~) edges, the other consisting of all {v~ , v~^) 
edges. □ 

We will refer to these two contingency sets as and S~ respec- 
tively, and they are associated with assignments Xi = true and 
Xi — false respectively. 

We have chosen rrii large enough to ensure that each clause con- 
taining Xi can be associated with a unique sequence of 9 consec- 
utive pairs of nodes in Gi '. the first clause in that contains Xi 
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Figure 9: Example instance D for query /i2 and corresponding 
instance D' for query h%,. 




Figure 8: Depiction of clause C = X V y V ^Z, If j is the start- 
ing position of Cs portion in the local rings of all 3 variables, 
then the literals of C map to the 3 bold edges in the graph 



corresponds to nodes {v^ 



' } and the edges be- 



tween them; the second clause in that contains Xi corresponds to 
nodes {v^q, . . . , v^q^v^q, . . . , v^g}, and so on. Let j be the starting 
position of C in the local ring of Xi, and suppose that Xi is part of 
the kih literal in C (A: = 1,2, or 3). If = Xi, then Lk corre- 
sponds to edge , vj^k) ™S' ^^^^ = "^-^i, then 
Lk corresponds to edge (v~_^j^_-^,v^_^f^). Figure 8 shows the edges 
corresponding to the literals of clause C = A V y V ^Z, assuming 
that j is the starting position of clause C in all three variable rings. 
Since we have set aside 9 pairs of nodes for this clause, in each of 
the 3 rings for the corresponding variables, we can always map the 
first literal to an (a, 6) edge (R tuple), the second and third literals 
to a (6, c) and (c, a) edge respectively (S and T tuples). 

Let (ai, bi), (^2, C2), and (ca, as) be the edges corresponding to 
the 3 literals of a clause C in rings d, G2 and G3 respectively. 
Then equate the following p airs of n odes: ai = as, 61 = 62, and 



C2 

Y.b 



C3. In the example of Fig. 8 
andy. 



X.aJ ^ 



Z.a 



'j+3- 



x.h- = 



aii»a J- = ^-S+S' where Xi.u denotes node u in the 
ring of Xi. 

Equating two nodes simply means collapsing them into a single 
node, causing the local rings to intersect. Therefore, each clause 
corresponds to a new cycle in Gd) comprised of three forward edges 
from three variable rings. In the example of |Fig. 8| the cycle caused 
by C is X.a+ ^ X.b-^, = Y.b++, ^ Y.cJ^, = Z.cJ^, ^ 
Z.a^^3 = X.a^. No additional cycles can be created due to the 
"buffer" nodes between clause portions in the local rings: a second 
clause on Xi will map to edges that are distanced by 5 or more 
nodes from the previous clause edges. Note that one buffer node is 
actually sufficient and thus the local rings can be made smaller. We 
chose to have portions of length 9 for each clause to simplify some 
notation, as now every clause starts with an a-node, but that is not 
actually necessary. 

The global graph Gd) therefore contains all cycles from the local 
rings Gi, plus one cycle for each clause in 0. We will now show 
that has a satisfying assignment iff Gd) has a contingency of size 
rrii, i.e. equal to the sum of the lengths of all local rings Gi. 

Lemma C.3. The formula cj) has a satisfying assignment iffGd) 
has a contingency of size ^^Trii. 

Proof. From [Lemma C.2| we know that a contingency for Gi 



cannot be smaller than t th, and in fac t will either be the set , or 
the set S~ (defined after Lemma C.2| . Any contingency for has 
to eliminate the cycles in Gi as well, and therefore has to contain 
5+ or 5-. 

We will first show the forward direction. Assume that cj) has a 
satisfying assignment A. Construct set S by selecting Sf or S~ 
for each variable Xi as follows: if Xi — true in A, then add the 
edges from Sf to <S, otherwise add the edges from S~ . Then S 
has size rrii. Assume clause C, and the cycle of forward edges 
due to its literals. At least one literal L of G evaluates to true 
under assignment A, since A is a satisfying assignment for cj). If 
L = Xi, then L maps to some edge e = (v~^,v~) G S^, and 
since Xi = true in A, e G S. Similarly, if L = -^Xi, then L 
maps to some edge e = {v~ ,v~^) ^ S~ , but again e G <S since Xi 
evaluates to false. Therefore, every cycle due to clauses of has 
an edge in S, and so <S is a contingency for C^. 

We will now show the reverse. Let <S be a contingency in Gd), 
such that |<S| = J2i ^ contingency in also defines a con- 
tingency in all Gi of respective sizes rrii ; S contains either or 
S~ , which map to Xi — true and Xi — false respectively. Let 
A be the assignment based on these values for each Xi, and let G 
be a clause of (j). Since <S is a contingency in C^, then at least one 
edge corresponding to a literal of G is contained in S. Let that lit- 
eral be L over variable Xi. If L — Xi, then the edge contained in 
<S is e = {v^ ^v~), and thus Sf C S. This means that Xi = true 
in A, so G is satisfied. Similarly, if L = ^Xi, then the edge con- 



tained in <S is e = (i 



), and thus S- C S. This means that 



Xi — false in A, and again G is satisfied. Therefore, assignment 
A satisfies all clauses meaning that (j) is satisfiable. □ 



Proof ( [Theorem A.\\ hl ) . We prove hardness of hi by a sim- 
ple reduction from /i2 • We start by writing the two queries as 

hi :-R{x,y),S{y,z),T{z,x) 

hi :- Ii{x\y'), S\y\ z'),T\z\ x'),A\x'), B\y'), G\z') 

Now we transform a database instance for h^ into one of hi as 
follows: For every tuple Vi in y), insert r^ as new tuple into 
A{x'). Repeat analogously for each Si^ti from S{y, z),T{z, x) 
and B' {x')^ G\z). Then, for each valuation — [a/ x^h/y^c/ z] 
that makes /i2 true over D, insert (n, Si), {si,ti), and {ti,ri) into 
R\ S\ and T\ respectively, where (r^, Si^ti) represent the tuples 
in the original R, S, and T corresponding to 0. Now we have a 
one-to-one correspondence between a tuples in R and A', S and 
B\ and T and G\ Comparing the lineages for these two queries, 
one sees that R\ S\ and T' are dominated by A\ B' , and G' , 
and that the minimal lineages are identical. Hence causes and their 
responsibility are identical. □ 



D. RESPONSIBILITY DICHOTOMY 
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Proof ( [Theorem 4.5| ) . It is straightforward from | Algorithm 1| 
The flow graph constructed has one edge per database tuple. The 
capacities of exogenous tuples are oo and all other tuples have ca- 
pacity one. Every unit of s-t flow corresponds to an output tuple of 
q. It is impossible that a flow does not correspond to a valid output 
tuple, as the partitions are ordered based on the linearization. This 
means that if a variable is chosen it will not occur again later in the 
flow, and therefore we can't have invalid flows going through one 
value of X in one partition and another in a later one. The steps of 
the algorithm: construction of the hypergraph, linearization, flow 
transformation, and computation of the maximum flow are all in 
PTIME and therefore responsibility of linear queries can be com- 
puted in PTIME. □ 

Proof ( [Lemma Case 1: q resulted from variable dele- 
tion {q ^ q[^/x\). Then we can polynomially reduce q' to q by 
setting variable x to the constant a. Therefore, if q is in PTIME, q' 
is also in PTIME. 

Case 2: q resulted from rewrite {q ^ q[{x^y) / x\). Assume 
q:-gi(x,...)g2{x,y,.. .)qo, then q :- g[ (x, y, . . .)g2 {x,y,...)qo. 
Reduce q to q as follows: For each subgoal g[ ^ q\ create a unique 
value {xi,yj) for each tuple g[ (x^, yj), and assign these as the tu- 
ples of gi{x) in q. Similarly, create a unique value (xi^yj) for 
each tuple ^2 (^i iVj)^ and assign each as a tuple ^2 ( (xi , ^/j ) , yj ) to 
form relation g2. Both queries have the same output and the same 
contingencies. Therefore, if q is easy, q also has to be easy. 

Case 3: q resulted from atom deletion (q q — {g}). As- 
sume . . . ,gm stndq' :- gi, . . .gj-i^gj^i, . . . ,gm- q' dif- 
fers from q in that it misses atom gj (y). The atom can be deleted 
with a rewrite only if it is exogenous, or 3gi (x), with x C y. 

We will reduce responsibility for to ^: Take the output tuples 
of q and project on y. Assign the result to gj. If gj is exogenous or 
if X C ^, then tuples from gj are never picked in the minimum con- 
tingency. If X — y, no minimum contingency will have the same 
tuple from gi and gj . Therefore the contingency tuples from the 2 
subgoals can be mapped to one of them, creating a contingency set 
for q Therefore, if q is in PTIME, q is also in PTIME. □ 

Proof ILemma 4. lOl A weakening q ^ q also results in a 
new database instance D. A weakening does not create any new 
result tuples, and does not alter the number of tuples of any endoge- 
nous non-dominated atoms. Therefore contingencies m q^D only 
contain tuples from non- weakened atoms making any contingency 
for q is also a contingency for q and vice versa. If any weakening 
results in a linear query [Algorithm Ij can be applied to solve it in 
PTIME, meaning that weakly linear queries are in PTIME. □ 

Lemma D.l (Containment). Ifq is final, thenMx, y, sg{x) g 
sg{y), where sg{x) the set of subgoals ofq that contain x. 

Proof. We will prove by contradiction. Assume that sg{x) C 
sg{y). Denote as qi the rewrite q ^ ^[0/x], and q2 the rewrite 
q ^ q[9/y]- Since q is final, both qi and q2 are weakly linear. 
That means that there are weakenings, denote them with Wi and 
W2, of qi and q2 respectively, that produce linear queries qi and 
q2. The weakenings Wi and W2 are sets of dissociation and dom- 
ination operations to queries qi and q2. qi and q2 determine linear 
orderings Li and L2 of the subgoals of qi and q2 respectively. All 
subgoals from qi containing some variable z appear consecutive in 
Li , but that may not be true for qi . Similarly for q2 and L2. 

Assume a third query ^3 as the rewrite ^2 q2[9/x], or equiv- 
alently qi Qi[9/y]- Note that any relation that is dominated in 
qi or q2 is also dominated in qs . Define the connected components 
of q3 as the set C3 = {Ci, C2, . . . , (7^} of maximum cardinality, 
such that \/i / j, d and Cj have no variables in common. We de- 
note with ^3[W2] the application to qs of all weakening operations 



defined in W2 that are valid for ^3. For example a dissociation that 
uses variable x is not valid for ^^3, as x ^3. Similarly, ^'4^2] 
denotes the subgoal gi after the application of all weakening oper- 
ations defined in W2 to query ^2 • 

Assume that q2 and ^3 have the same connected components, and 
^3 :— ^3[VK2]. Then, ^3 is equivalent to the rewrite q2 ^ q2[9/x], 
and therefore also linear. 

Even if the connected components are not the same, two com- 
ponents do not share the same variables. Therefore, a dissociation 
operation that relied on a neighborhood that disappeared with the 
removal of x would have only served to transfer a variable from one 
component to another. This obviously does not affect the linearity 
of the result in the case of ^3- This means that ^3 is weakly linear, 
and ^3 '— ^3[W2] offers a linearization. 

We will now use the connected components C3 of ^3 to define 
linear orders for q2 and qi . 

We map each d G C3 to the linear order L2. Ci may appear 
fragmented (not contiguous) in L2. We call a fragment of a com- 
ponent Ci in a sequence L, a maximal length subsequence of L, 
such that every subgoal in the subsequence appears in d. An ex- 
ample is shown in [Fig. lOj 



C2.1 C2.2 C2.3 



m 















Ci C2 Cs 



Figure 10: q2 = q-^ q[^/y] and qs = q2 q2[^/x]. 

Assume a component d which appears fragmented in L2, and 
d.l and Ci.2 two of its fragments. Let S be the set of subgoals 
separating d.i and Ci.2- All subgoals in S belong to a component 
other than d, and therefore share no variables with d.i or Ci.2. 
Assume that there is a subgoal gj G S that is not modified by the 
weakening W2, and therefore contains no variables from d. If that 
were the case, then it would be impossible for all variables of d 
to appear consecutively in L2 which is impossible as L2 is a linear 
order for q2. Therefore, all subgoals of S are modified by W2, and 
are therefore exogenous (or dominated, and therefore made exoge- 
nous). Therefore, any connected component of ^3 either appears 
contiguous in L2, or any subgoals separating 2 fragments are ex- 
ogenous. 

Denote with Lx the subsequence of L containing the "span" of 
variable x: \/gi e Lx,x e gi[W2], and V^'^ ^ Lx,x ^ 9i[W2] - Let 
Wx be the set of dissociations that propagate x to all the exogenous 
subgoals adjacent to Lx. Let W2 = W2 U Wx. Then ^2 ^2[VK2] 
is also a linear query, and L2 is a linear order for it. 

Since ^2 has x propagated to all adjacent exogenous subgoals, 
there can only be up to 2 components of C3 that only partially con- 
tain X in q2. For the rest of the components, either all of their 
subgoals contain x in ^2 or none of them does. Partition C3 as 
follows: 

HC^\y9ed,xeg[Wi]} 
Cx ={C^\^ged,x^g[W^]} 

Cx ={Ci I 3gr,gt e C^ s.t. x e gr[W2] and x ^t[Vl^2]} 

As noted, < 2. Mapping C3 to Li using the same logic, 
produces sets Cy, Cy and C* for variable y and weakening Wi, 
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which is defined similarly as W2. Since sg(x) C sg(y), it is Cx ^ 
C+andC* C C+UC;. 

Denote with d [Wj] the application of weakening Wj to all sub- 
goals of component d. Then VC^ G C3, C^Vl^i'] and C4W^2] are 
both linear. Therefore, ^3[W{] and g'3[W2] are also linear. More- 
over, the components C3 can be freely rearranged in the linear order, 
and linearity is always preserved. 

Let Cy"" = C + - (C+ UC*),Cl and Cy the two components 
in Cy , and and the two components in C* . 

First, assume C * n C * = 0, and apply W' to q. Then it is 
straightforward that the following is a linear order with respect to 
all the variables, including x and y: 



C'y[Wi] 










c;[wi] 


Cy[wi] 



We now examine the case Cx^iCy / 0. Assume d sl connected 
component of C3, such that d G C* and d ^ Cy. Ci[Wi] is in 
linear order for all variables except possibly for x, and Ci[W2] is 
in linear order for all variables except possibly for y. Let Wx be 
those weakenings in W2 that dissociate atoms by adding x to their 
variable set. Since sg{x) C sg{y), it is possible to also add y to 
the same atoms. Let Wx^y be the same weakenings as in Wx, but 
with variable y instead of x. Then in qc^ '— Ci[W2 U Wx^^] it 
also holds that sg{x) C sg{y). 

Let Wy be those weakenings in Wi that dissociate atoms by 
adding y to their variable set. Then Wy can still be applied to qc^ , 
as atoms that were neighbors in qi are also neighbors in qc^ . There- 
fore, qci [Wy] is linear for all variables including y. Thus, the same 
order of components as the one above, but with q^ [Wy] replacing 
Cl[Wi] and Ci[W2] (equivalently for C;;[Wi] and [M^s]), will 
be linear in all variables including x and y. But that would mean 
that q is weakly linear, as there is a weakening for it that is linear, 
which is a contradiction. Therefore, sg{x) g sg{y). □ 

Lemma D.2. Ifq is final, then it has exactly 3 variables. 

Proof. First of all, if q has 2 or fewer variables, then it is linear, 
and therefore cannot be final. So, q must have at least 3 variables. 
We will assume that Var{q) > 3 and prove the lemma by contra- 
diction. We will reach contradiction by applying rewrites to q, and 
showing hardness of the resulting query, which means that q cannot 
be final. There are two possible cases: either all atoms are unary 
or binary, or there exists an atom with 3 or more variables. We 
examine the cases separately. 

Case 1. yg^ G q, Var(gi) < 2. 
Since q is final, it cannot be linear. For non-linearity q needs to 
have at least 3 non-dominated atoms. Also, since every atom has at 
most 2 variables, q has to be either cyclic, or have a "corner point" 
in its dual hypergraph like the one shown in |Fig. 11 1 A comer point 
in the dual hypergraph is defined as a hyperedge which intersects, 
but does not contain, at least three other hyperedges. We will refer 
to these as branches of the comer point. 

Case A: (The dual hypergraph of q has a corner point) 
The three non-dominated atoms have to be on separate branches of 
the corner point. Assume variable w is the corner point, with the 
following three corner point branches that together form query q: 

. . . A{x, u), Ri{x, wi),R2{wi,W2), . . . , Ri{wi-i,w) 
. ..B{y,u),Si{y,w[),S2{w[,W2), . . . , Sj{w'j_i,w) 
. . .C{z,u'),Ti{z,Wi),T2{wi,W2), . . .,Tj{wk_i,w) 

Note that there has to be at least one R, one S and one T tuple in 
order for to be a corner point. W.l.o.g assume that A, B, C are 




Figure 11: Variable it; is a corner point for a query where 

Var{g,) < 2. 



non-dominated atoms. We can ignore the u variables if the non- 
dominated atoms are unary. We apply the following rewrites to q, 
Vt: 



q^ q[{wt-i,wt)/wt] 
q-^ q[{wt-i,Wt)/wt] 
q-^ q[{wt-i,Wt)/wt] 



q ^ q[{wt,wt-i)/wt-i] (3) 
q q[{wt,Wt-i)/wt-i] (4) 
q q[{wt\ Wt-i)/wt-i] (5) 



Now all the R tuples contain variables wi, . . . Wi-i, and equiva- 
lently for the S and T atoms. We also transfer variables x and w 
with the following rewrites: 



q--^ q[{x,wi)/wi] 
q--^ q[{y,w[)/w[] 
q-^ q[{z,w'i)/w'i] 



q-^ q[{w,wi)/wi] 
q^ q[{w,w[)/w[] 
q-^ q[{w,Wi)/wi] 



The rewrites transform all of Ri, R3, . . . Ri to the same atom 
R{x, wi,W2, . . . , Wi-i,w). Equivalently for the S and T atoms. 
Finally, remove all variables u other than {x,y, z,w) by applying 
the rewrite q ^ q[9/u]. Therefore, after the rewrites the query 
becomes: 

q' :- B{y), C{z),R{x, w), S{y, w),T{z, w) 

We can reduce hi to q' as follows: Atoms A, B and C remain 
unchanged. For each tuple W{x,y, z) we assign a unique value 
w: (x, 2/, z, w). We get relation R by projecting on (x, w), and 
similarly for S and T. Responsibility of a tuple in h\ is the same 
as the responsibility of the tuple in q . Therefore q is hard, which 
means that q cannot be final. 

Case B: (The dual hypergraph of q is cyclic) 
We will distinguish between 4 possibilities for the non-dominated 
relations: (a) all of them are unary, (b) exactly one is binary, (c) 
exactly two are binary, and (d) all of them are binary. 

(a) Let A, B and C be the non-dominated atoms. Then q is of 
the form: 

A{x),Ri (x,wi),R2(wi,W2), . . . ,Ri(wi-i,y) 
B{y),Si{y,w[),S2{w[,W2), . . .,Sj{wj_i,z) 
C{z),Ti{z,Wi),T2{wi,W2), . . -,Tj{wk-i,x) 

We apply the rewrites ([3]), ([4]) and ^ from case A. Since we know 
that Var(q) > 3, we are guaranteed that at least one rewrite will 
happen, as there must exist at least one w variable. The rewrites 
transform all of Ri, ... Ri io the same atom i?(x, wi, . . . , Wi-i, y). 
Equivalently for the S and T atoms. Name the result query q'. 
Query can be trivially reduced to q by setting all the w variables 
to a constant. Therefore q is hard, which means that q cannot be 
final. 

(b) Let A, B and C be the non-dominated atoms, and q is of the 
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form: 

A(x),Ri (x,wi),R2(wi,W2), . . . ,Ri{wi-i,y) 

B{y),Si{y,w[),S2{w[,W2), . . .,Sj{wj_i,z) 

C{z,u),Ti(u,w'i),T2{w'i,W2), . . .,Tj{wk_i,x) 

We apply the rewrites ([3]),(|4]) and ^ from case A. The query then 
becomes: 

A{x), R(x, wi,W2, . . . , Wi-i,y) 
B{y),S{y,w[,W2, . . .,Wi_i,z) 

C{z, u), T(u, Wi, W2, . . . , Wi_i, x) 

It is possible that there are not w values, and no rewrites actually 
occurred. Apply the rewrite q ^ q[(u, z)/u]. This is guaranteed 
to occur, due to the given existence of a fourth variable u, since 
Var{q) > 3. Further rewrite off all w variables. After these we 
get query q :- A{x), B{y), C{z, u), R{x, y), S{y, z), T{z, u, x). 
If we now apply the rewrite q q[^/u] we get h% which is hard. 
Therefore, q cannot be final. 

(c) Let A, B and C be the non-dominated atoms, and q is of the 
form: 

A{x),Ri{x,wi),R2{wi,W2), . . .,Ri{wi-i,y) 

B{y,v),Si{v,w[),S2{w[,W2), . . .,Sj{wj_i,z) 

C{z,u),Ti{u,w'i),T2{w'i,W2), . . .,Tj{wk_i,x) 

It is possible that v = z, but x cannot be the same as any of 
{y^v^z^u) as in that case A would dominate B or C. We ap- 
ply the rewrites ([3]), ([4]) and ([5]) from case A, and also rewrite off 
all variables w, e.g. q ^ q[^/wi\. The query then becomes: 
q :— B{y^ v), C{z, u), R{x, y), S{v, z),T{u, x). Further ap- 

ply the following rewrites: 

q' ^ A(x), B(y, v, z), C(z, u), R(x, y), S(v, z),T{u, x) (add z) 
^ A{x), B{y, V, z), C{z, u), R{x, y), S{y, v, z), T{u, x) (add y) 
^ A(x), B(y, z), C(z, u), R(x, y),T(u, x) (delete S and v) 

^ A(x), B(y, z), C(z, u), R(x, y),T(z, u, x) (add z) 

^ A(x, y), B(y, z), C(z, u), R(x, y), T(z, u, x, y) (add y) 

^ A{x, y), B{y, z), C{z, u), T{z, u, x, y) (delete R) 

^ A{x^ y), B(y, z), C(z, u, x), T(z, u, x, y) (add x) 

^ A{x^ y), B(y, z), C(z, x) (delete T and u) 

Note that in atom S has been eliminated, so even if it was v = z, 
in which case there would be no S, it would not affect the result; 
basically the first 3 of the above rewrites would be unnecessary. 
The last query is /i2, which is hard, which means that q cannot be 
final. 

(d) Let A, B and C be the non-dominated atoms, and q is of the 
form: 

A{x,t),Ri{t,wi),R2{wi,W2), . . .,Ri{wi-i,y) 

B{y,v),Si{v,w[),S2{w[,W2), . . .,Sj{wj_i,z) 

C{z,u),Ti{u,w'i),T2{w'i,W2), . . .,Tj{wk_i,x) 

It is possible that t = y or v = z {in which cases there would 
be no or 5* atoms respectively). Again we apply the rewrites 
([3]),(|4} and ([Sj from case A, and also rewrite off all variables w, 
e.g. q^ q[^/wi\. The query then becomes: 

q' :- A{x, t), B{y, v), C{z, u),R(t, y), S{v, z),T{u, x) 

To account for the case t = y or v = z, will eliminate with 
rewrites the R and S atoms, so the effect of the possible variable 



equivalence will be eliminated: 

q' A(x, t), B(y, v, z), C(z, u), R(t, y), S(v, z),T(u, x) (add z) 
^ A(x, t), B(y, V, z), C(z, u), R(t, y), S(y, v, z), T(u, x)(add y) 
^ A(x, t), B(y, z), C(z, u), R(t, y), T(u, x) (delete S and v) 
A{x, t, y), B{y, z),C{z, u), R{t, y),T{u, x) (add y) 

A(x, t, y), B(y, z), C(z, u), R(x, t, y),T(u, x) (add x) 

^ A(x, y), B(y, z), C(z, u),T(u, x) (delete R and t) 

With further rewrites q ^ q[{u, and q ^ q[{x, y)/x] we get 
q' :— A{x, y), B{y, z), C{z, u),T{z, x, y), which as seen in (c) 
previously leads to /i2 with further rewrites. Therefore, q cannot be 
final. 

Case 2. 3gi G q, Var{gi) > 3 
Let y, z, . . .) be an atom of q. Let Sx, Sy and Sz be the sub- 
sets of subgoals of q that contain variables x, y and z respectively. 
From [Lemma D . 1 1 we know that Sx, Sy and Sz cannot be subsets 
of one another. We always know that they intersect due to relation 
R. That means that there are 3 more atoms A, B and C in ^ that 
contain x, y, z. There are 2 possible choices for A, B and C that 
satisfy the containment requirement. 




Figure 12: Subgoal R contains all 3 variables. There is only 

2 ways to choose A, B and C so that Sx, Sy and Sz are not 
subsets of one another. 

(a) A{x, y, . . .),B{y, z, . . .), C{z, x, . . .), z ^ A, x ^ B md y ^ C 
For all variables u other than {x^y, z) we rewrite: q ^ q[9/u]. 
Since Var{q) > 3, we know that there is at least one such variable 
u. By further rewriting q ^ q — R, get /i2, which is hard, and 
which means that q is not final. 

(b) A{x, . . .), B{y, . . .), C{z, . . .), y, z ^ A, x, z ^ B and x,y ^ C 
For all variables u other than (x^y^z) we rewrite: q ^ q[9/u]. 
Since Var{q) > 3, we know that there is at least one such variable 
u. The result of these rewrites is query hi which is hard, meaning 
that q is not final. 

Therefore, any final query has exactly 3 variables. □ 

Proof ( [Theorem 4.13| ). From [Lemma D.2[ since q is final, it 
has exactly 3 variables and it is not weakly linear. Let x, 2/, 2; be the 

3 variables. An atom in q can be in one of the following 7 forms: 

A(x), B{y), C{z), R{x,y), S{y,z), T{x,z), W{x,y,z) 

Note that because of the third rewrite rule, and since q is final, we 
only need to consider queries with at most one atom of each type 
(e.g. a query containing i?i(x, y) and i?2(x, y) cannot be final). 
There are 7 possible atom types, and therefore 127 possibilities for 
queries. We do not need to analyze all of those, as most of them are 
trivially weakly linear (queries with less that 3 atoms, or less than 
3 variables) and therefore cannot be final. 

We will show that out of all the possible queries made up as 
combination of the 7 basic atoms, only h\, /i2, h% are final. The 
rest are either weakly linear or can be rewritten into the hard queries 
(and therefore are not final). Note that any query q whose atoms are 
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a subset of the atoms of hi or /i2 is linear. Therefore, we only need 
to check supersets of hi and h2, and subsets of h^. Any query 
where A, C are all endogenous, or T are all endogenous, 
is covered by these cases. We finally need to examine queries where 
at least one unary and one binary atom appear as exogenous. For 
simplicity, from now on we will drop the variable names and just 
use the relation symbols, in direct correspondence to the atom types 
mentioned above (e.g. we write A instead of A{x) and R instead of 
y)). Also, if the endogenous or exogenous state is not explicit, 
it is assumed that the atom can be in either state. 

Case 1 : supersets of hi :— A", B", C", W. The only possible re- 
lations that can be added to hi from the possible types are the bi- 
nary relations R, S and T. For any of them, that are added to 
hi, there exists a singleton relation (A, B or C) with a subset 
of their variables. Therefore, we can apply the third rewrite, eg. 
q ^ q — {R} to get hi. Therefore, any query q over variables 
x,y, z that is a superset of hi is not final because it can be rewrit- 
ten to hi . 

Case 2: supersets of /i2 :— R^, S^, T". The possible atom types 
that could be added are the ternary relation W, or the unary atoms 
A, B, and C. There are 5 possible cases excluding symmetries 
(adding A to /12 is equivalent to adding B instead). An atom in 
parentheses means that it may or may not be part of q. 

(a) ^:-(A^)(B^)(c^)i?^>s'^T^H^: (and 

can be eliminated based on the third rewrite leading to /i2 • 

(b) q :- A^ (B^ )(C^ )R, S, T(, Wy weakly linear as A dom- 
inates R and T (and W), and B^ and can be dissociated 
to W. 

(c) q :- A^ B^ (C^ )R, S, T(, wy. weakly linear as R, S, T 
(and W) are dominated, and can be dissociated to W. 

(d) q :- A^ B^ C^ R, S, T-. Ms is hi. 

(e) ^ :- A^ B^ C^ R, S, T, W-. W can be eliminated based on 
the third rewrite leading to h^ . 

Case 3: subsets of h^ :- A^ B^ C7^ R, S, T. We only need to 
consider those where at least one of the R, S, T atoms is exogenous 
or missing, otherwise they would fall under case 2. We have the 
following cases (excluding symmetries): 

(a) q :— A, B,C, R, S: linear, and therefore any of its subsets are 
also linear. 

(b) q:- B'',C'',R,S,T'': weakly linear as R,S,T are domi- 
nated and can dissociate to W. 

(c) q-.-C, R, S^T"": T dissociates to W resulting in a linear 
query. Any subsets would also be linear. 

Case 4: at least one exogenous unary and one exogenous binary 
relation. 

(a) q :- A^ B, C, S, T, W: A and R dissociate to W, result- 
ing in a linear query. Any subset is also weakly linear. 

(b) q :- A^ B, C, R, 5'^ T, W: A and S dissociate to W, result- 
ing in a linear query. Any subset is also weakly linear. 

Therefore, we have shown that any final query has to be one of 

hl,h^2,hl □ 

E. OTHER RESPONSIBILITY PROOFS 

Proof ( [Theorem 4.15| ). We will show the result through a 
series of reductions. We will start by a known LOGSPACE complete 
problem, the Undirected Graph accessibility Problem (UGAP): given 
an undirected graph G = {V, E) and two nodes a, 6 G V, decide 
whether there exists a path from a to b. 

BGAP reduction: We define the Bipartite Graph Accessibility 
Problem (BGAP): given a bipartite graph (X, y, E) and two nodes 



a G X, 6 G y , decide whether there exists a path from a to h. Here 
the path is allowed to traverse edges in both directions, from X to 

Y and from Y io X. We will reduce any instance of UGAP to an 
instance of BGAP as follows: 

Given an instance of UGAP as an undirected graph G = {V^E), 
and nodes a, 6 G V, construct a bipartite graph with X — V, 

Y — ^ U {c}, where c is a new node, and edges are of the form 
(x, (x, 2/)) and(?/, (x, y)), plus one edge (6, c). Then there exists a 
path a ^ 6 in G iff there exists a path a ^ c in the bipartite graph. 
Therefore, BGAP is hard for LOGSPACE. 

FPMF reduction: We define the Four-Partite Max-Flow problem 
(FPMF): given a four-partite network ([/, X, y, E) where each 
edge capacity is either 1 or 2, source and target nodes s and t con- 
nected to all nodes in U and V respectively with infinite capacities, 
and a number /c, decide whether the max-flow is > k. We reduce 
BGAP to FPMF as follows: 

Given an instance of BGAP as a bipartite graph (X, y, E) and 
two nodes a ^ X,b ^Y , construct a 4-partite graph ([/, X, y, E') 
by leaving the X and Y partitions and edges between them un- 
changed, as they are in the BGAP instance, and set their capacities 
to 2. Create a [/-node xy for each edge (x, ^) G E. Each node 
xy ^ U is connected to x ^ X with an edge of capacity 1 . Sym- 
metrically, the F-nodes are E, and each node y ^ Y is connected 
to all nodes xy G V, with capacity 1. Finally connect a source 
node s to all nodes U with infinite capacity, and connect all nodes 
in y to a target node t also with infinite capacity. The resulting 
graph has a maximum flow (min-cut) equal to \E\: the number of 
edges between any 2 partitions is exactly equal to E, and edges be- 
tween the X and Y partitions are not chosen in a minimum cut, as 
they have capacity 2 instead of 1 . The maximum flow of E utilizes 
slWU — X and Y — V edges, and a residual flow of 1 is left in all 
X — Y edges. 

Now add to the graph a new node a' in partition U connected 
with capacity 1 to node a in X and with infinite capacity to the 
source node. Also add a node b' to partition V, connected to node 
b of partition Y with capacity 1, and to the target node with infinite 
capacity. The flow in this final graph is |^| iff there is no path be- 
tween a and b in the BGAP instance, and it is |^| + 1 iff there is a 
path between a and b. Therefore, BGAP can be solved by comput- 
ing the maximum flow in the FPMF instance with k = \E\ -\- 1. 

Query reduction: We will reduce FPMF to computing responsi- 
bility for query q. Let {X,Y, Z,W, E) and number k be an in- 
stance of FPMF. For each (xi^yj) edge between partitions X and 
y create a tuple R{xi, l,yj) is the capacity of the edge is 1, and 
two tuples R{xi, l,yj) and R{xi, 2, yj) if the capacity of the edge 
is 2. Similarly create relation S{y, U2 , z) based on the y — Z edges, 
and relation T{z^ U3,w) based on the Z — W edges. Finally, add 
tuples R{xo, l,yo), S{yo, 1, zq), and T{zo, l,wo), where xq, yo, 
zo and wq are unique new values to the respective domains. The 
max-flow in the FPMF instance is > A; iff the responsibility of 
i?(xo, 1,2/0) is > k. Therefore, responsibility for q is hard for 
LOGSPACE. □ 

Proof ( [Prop. 4.161 Self-Joins). This results from a reduc- 
tion from vertex cover. Given graph G{V,E) as an instance of a 
vertex cover problem, we construct relations R and S as follows: 

• For each vertex ^'^ G create a new tuple with unique 
value of attribute Xi . 

• For each edge (vi ,Vj) G E, create a new tuple Sfc , with values 
(x,y) = (xi, Xj), where X , X j are the values of tuples r^, rj 
that correspond to nodes Vi, and Vj respectively. 

• Add tuples ro with value xo and so with value (xo, xo). 
The above transformation is polynomial, as we create one tuple 

per node and one tuple per edge. A vertex cover of size K in G 
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is a contingency of size K for tuple ro in the database instance: 
removing from the database all tuples ri corresponding to the cover 
leaves no other join result apart from the one due to ro, sq: All other 
Si = {xi^Ui) ^ So do not produce a join result, as at least one of 
R{xi) or R{yi) has been removed. 

Now assume a contingency S for tuple ro. If <S contains a tu- 
ple Si = {xi,yi), then we can construct a new contingency S' = 
{S \ {si}) U {R{xi)}, and |<S'| < |<S|. Therefore, there exists a 
minimum contingency S that contains only i?-tuples. If V' the set 
of nodes that corresponds to tuples ri G <S, then V' is a vertex 
cover in G. If there was an edge left uncovered, then that means 
that there would be a tuple S{xi,yi), such that neither of R{xi), 
R{yi) are in the contingency, which is a contradiction as the join tu- 
ple R{xi), S{xi,yi), R{yi) would then be in the result. The cover 
V' is minimal, because S is minimal. □ 

Proof ( [Theorem 4.17| why-No Responsibility). This is a 
straightforward result based on the observation that the contingency 
set of a non-answer is bounded by the query size, and is therefore 
irrelevant to data complexity. In order to make a tuple counterfac- 
tual, we need to insert at most m — 1 tuples to the database, where 
m the number of query subgoals. □ 



